Turnbull China Bikeride

home *** CD-ROM | disk | FTP | other *** search

/ Turnbull China Bikeride / Turnbull China Bikeride - Disc 2.iso / STUTTGART / FROMUTS / WORDCHECK / !WordChk / README < prev next >

Wrap

Text File | 1992-03-17 | 13KB | 365 lines

Word Check Module (Formally SpellChk) ================== RiscOS 3 version 0.03 © Geoff. Lane. Mar 1992 Internet: zzassgl@uts.mcc.ac.uk Janet : zzassgl@uk.ac.mcc.uts (The information given here may not exactly match the current state of the module.) Introduction ------------ This implements a word spelling check module for British/American technical English (with a selection of Arc specific words added for good luck) or other (unsupplied)languages. It is based on a description in Chapter 13 of "Programming Pearls" by Jon Bentley (ISBN 0-201-10331-1) of a clever algorithm devised by Doug McIlroy in 1978. Many RiscOS based word processors contain a spelling checker; each has its own word list and interface. These checkers are impossible to use outside the application. For each application there will be one or more large diictionary files each unique to the application. If a standard interface were created using SWIs it would be possible of one module to provide Word check function to all applications that required such a facility. If an application could rely on a Word check module being available then the application would be smaller. The facilities provided within this module are a first attempt to define such an interface. This program was created and tested on a 2M A5000 machine running RiscOS 3. The compiler used was Norcroft C version 3.0. The shared library version used was 3.87 (I haven't included any RMEnsure commands in the !Boot or !Run files - I'm not aware of any version dependancies) If you don't have the C library in ROM then you should edit the files to check for and load the shared library. Algorithm --------- The algorithm allows a dictionary to be highly compressed by encoding each word as a unique 32 bit number. The resulting list of numbers is sorted and then a table of differences is created. This table of 16 bit numbers is included in the module. To find a word in the table you encode the word and then see if it is possible to generate the same value by summing the contents of the table - if you get a match then the word is valid. (Of course you have to chose the encoding algorithm quite carefully to ensure that the vast majority (in this case only 19 words hash to the same value) of words translate to unique numbers and the differences between each pair of sorted numbers is always less than 32K.) Thus a module of about 71K bytes can be used to check the spelling of about 32000 different words. Building Word Lists ------------------- The word lists were created from a number of sources... * Take 1.5 Gbytes of Netnews. * Take the Brown University Corpus of English Usage. * Take the Unix online manual pages. * Take the RiscOS help text. Delete the non-alphabetics, sort and delete duplicates. From this you obtain a huge list of words plus a lot of junk. You pass this list through a standard spelling checker and then check the reject list for words that are useful but not accepted. Features -------- * Installed as a module. * Large dictionary but small memory requirements. (To an old BBC B programmer the fact that I can even contemplate dedicating over 64K of memory to a utility is slightly obsene.) * General purpose. The spelling check is available to any program (or module) and not restricted to a particular application. * Multiple language support. British English, {work in progress American English} are supplied. The language to be used in a given run can be selected by command. The (changable) default is British English. * Fairly fast. On an A5000 the current version could check about 500 words/second when running a test program within a Task Window and about 620 wps running "native". (It is interesting to know that in the description of the algorithm in the book mentioned above the speed is described as about 170 wps on a VAX 11/750 - this was considered fast at the time the book was written! The VAX version was just under 64K in size - the dictionary was a bit smaller.) Bugs and/or Misfeatures ----------------------- (It's not as bad as it looks. The module is only intended to implement basic spelling checks; clever preprocessing should be done by the application program and not set in stone within the module.) * Does not perform pre- or post-fix stripping. * Can't cope with many plurals (special case of the lack of post-fix stripping.) * Complains about what it believes to be uncorrectly capitallised words (Many "standard" capitallisations are encoded in the word list.) * Does not check single character words (i,a,...) * Currently not possible to supply a personal dictionary to be added to the standard pre-loaded dictionary. * The algorithm used can miss a small proportion of bad spellings. (About 1 in 1000 misWorded words will get through.) This is a result of the way that the words are encoded -- the error rate could be reduced at the cost of a larger word table but then the major advantage of having a small module size (and thus speed) are lost. * Anagram solver will be limited and quite slow. * Word finder is limited and quite slow. It operates by using a brute force search using all possible words that may exist which fit the supplied partial word. Most of the possibilities are incorrect spellings so it pushes the checking algorithm to it's limits and thus reports more incorrect words than it should. * Difficult to use as a spelling corrector. The module cannot suggest close matches to a supplied word as the encoding algorithm generates unrelated hash values for similar words; in addition the original word list is not available to the module at run-time. Configurable Bits ----------------- * The default dictionary language used when the module is loaded can be changed by altering the "WordChk$DefLang" environment variable in both !Run and !Boot files. * New languages can be installed by adding the encoded dictionaries to the "Languages" sub-directory. They can then be specified as the default language or loaded with the *WordLoad command. Command Interface ----------------- This allows a single word entered from the command line to be checked against the current dictionary. *WordCheck <word> ok/unknown {Work in progress} This treats the supplied word as an anagram and tries to rearrange it into words that are found in the current dictionary. *WordGram <word> This takes a word with missing characters (indicated by ?'s in the string) and tries to find matching words in the current dictionary. *WordFind <partial word> This loads a new language as the current dictionary. At the moment valid languages are "British" {work in progress, "American" and "Technical".} *WordLoad <language> Program Interface ----------------- The module provides the following SWIs... "WordCheck_Word" Input R0 pointer to string to test (null byte terminated character string as generated by BASIC V or C) Output R0 preserved R1 returns boolean (-1/TRUE or 0/FALSE) BASIC Example SYS "WordCheck_Word","syzygy" TO ,valid% returns valid% = -1/TRUE (honest) whereas SYS "WordCheck_Word","pointer" TO ,valid% returns valid% = 0/FALSE (shame!) "WordCheck_Find" Input R0 pointer to string to test with up to three '?' characters indicating unknown characters (null byte terminated character string as generated by BASIC V or C) To obtain further possible matches use "WordCheck_FindNext" Output R0 preserved R1 returns first found word or null string if nothing found. BASIC Example SYS "WordCheck_Find","te?t" TO ,match$ returns match$ = "teat" "WordCheck_FindNext" Output R1 returns next matching word or null string if nothing found. BASIC Example This assumes that "WordCheck_Find" has been called with an initial partial word of "te?t". SYS "WordCheck_FindNext" TO ,match$ returns match$ = "tent" Further matches can be obtained by more calls to "WordCheck_FindNext" until the end of all possible matches is indicated by the return of a null string. For instance,in BASIC, to find all matches use code similar to... SYS "WordCheck_Find","te?t" TO ,m$ WHILE m$ <> "" PRINT m$ SYS "WordCheck_FindNext" TO ,m$ ENDWHILE "WordCheck_Load" Input R0 pointer to string holding language name to load (null byte terminated character string as generated by BASIC V or C.) The corresponding named language file must be present in the Languages sub-directory within !WordChk. Output R0 preserved R1 returns -1/TRUE if successful otherwise 0/FALSE. BASIC Example SYS "WordCheck_Load","British" TO ,ok% returns ok% = -1/TRUE if language "British" has been loaded. returns ok% = 0/FALSE if failed to load new language. Building New Dictionary Files ----------------------------- [[[ NOT IN THIS VERSION ]]] A program, BuildDict, is provided which can create new encoded dictionary files from word lists. These files can then be loaded into the module. To create a new dictionary you need to do the following... * Gather a word list. There must be at least 256 words in the list and there will probably have to be many more words in order that the difference between the hash values is always < 64K. There should not be more than 33000 words in the list. * Delete single character words and ensure that there are no leading or trailing spaces or tabs at the end of the words. There should only be one word per line. * Sort the list (not essential for BuildDict but needed for following step.) * Delete duplicate words. * Place the file in the WordLists sub-directory of !WordChk. * Run the BuildDict program. This will create, if successful, an encoded dictionary file in the Languages sub-directory. There are a number of possible fatal errors that may occur during processing. * Change the WordChk$DefLang value set in !Boot and !Run to make your new language the default or use the *WordLoad command to load the new language into a running module. The hash algorithm has been optimised for UK/US English. It not be suitable for other languages. A future version of !WordChk may include a means to alter and re-optimise the hash algorithm if necessary for each language to be loaded. Foreign Languages ----------------- True spelling checkers for foreign languages are complicated by the fact that most of them care about the 'sex' of the words. Some of them are so regular that native writers rarely make spelling errors other than simple 'typing' errors ie transposition of characters. Some languages insist on strange characters that do not appear on the keyboard. The Arc copes quite well with the strange characters for languages such as french and German. Languages such as Esperanto are not so well provided for as the accents appear on unexpected characters and special provision would have to be given to defining them. In any case WordChk is just a word checker and not a full spelling checker (the difference is that one just tries to match a word to one in a list, the other attempts to manipulate the word in various ways to attempt to find the root.) British {word list being repaired} American {word list being repaired} Computing {work in progress} Hacking {work in progress} French need smaller word list! German need word list. Italian need word list. Esperanto can't display accented characters from default font. Latin need word list (getting a bit weird here?) ===================================================================== The legal bit: I don't care what kind of ...ware it is called but I retain copyright on the code and encoded dictionary table used in this particular RiscOS implementation of the spelling checker algorithm. You can distribute version 0.03 of the WordCheck module and associated files as far and wide as you wish so long as this README file is also distributed with the module and hash file. You may include the !WordChk application (or just the language, Wordchk module and README files) within another application that makes use of its facilities. If you paid money (other than a small amount for disc duplication and postage) for these files then you've been ripped off. As noted above, this code is in alpha test and you take your own chances with bugs, spelling errors etc. ======================================================================